Routing as Statistical Classi cation

نویسنده

  • Jan Pedersen
چکیده

In this paper, we compare learning techniques based on statistical classiication to traditional methods of relevance feedback for the document routing problem. We consider three classiication techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression , and neural networks. We demonstrate that the classiiers perform 10-15% better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks. Of the two classical information retrieval tasks 1 document routing is most amenable to machine learning. A xed, standing query, and a training collection of judged documents 2 is provided and the task is to assess the relevance of a fresh set of test documents. This can clearly be approached as a problem of statistical text classiication: documents are to be assigned to one of two categories, relevant or non-relevant, and inference is possible from the labeled documents. In contrast, the classical ad-hoc search problem presumes only a query and an unlabelled collection is provided. The standard approach to document routing models document content as a bag-of-words, represented as a sparse, very high-dimensional vector, with one component for each unique term in the vocabulary (Salton, Wong, & Yang 1975). Vector weights are proportional to term frequency and inversely proportional to collection frequency. 3 The general technique is to score test documents with respect to their closeness to the query (also represented a sparse, high-dimensional vector), Authors listed in alphabetic order. 1 as deened and evaluated by the TREC confer-ences(Harman 1994; 1995) 2 Actually, only a few documents are explicitly labeled, including most of the relevant documents and a few of the irrelevant documents. All other documents are implicitly assumed to be irrelevant. 3 The exact expression varies across systems, but is typ-where closeness is measured by the cosine between vectors. A modiied and expanded query is learned from the training set via Rocchio-expansion Relevance Feedback (Buckley, Salton, & Allan 1994), which essentially constructs a linear combination of the query vector, the centroid of the relevant documents and, occasionally, the centroid of select irrelevant documents 4. The net result is a scored list of test documents, which may be ranked in decreasing score order for the purposes of presentation and evaluation. Evaluation typically proceeds by averaging precision 5 at a number of recall 6 thresholds. Rocchio-expansion Relevance Feedback employs a weak learning method. However, the application of stronger methods faces two problems: the …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison and Combination of Statistical and Neural Network Algorithms for Remote-sensing Image Classification

In recent years, the remote-sensing community has became very interested in applying neural networks to image classi cation and in comparing neural networks performances with the ones of classical statistical methods. These experimental comparisons pointed out that no single classi cation algorithm can be regarded as a \panacea". The superiority of one algorithm over the other strongly depends ...

متن کامل

Information Retrieval Using Statistical Classification

In the classical information retrieval IR problem the system must nd all docu ments in a collection that are related to a topic de ned by a user s query A common approach to the IR problem is to represent documents and the query as vectors of term frequencies and rank the documents in the collection according to their inner product similarity with respect to the query When a sample of evaluated...

متن کامل

First order Gaussian graphs for e#cient structure classi$cation

First order random graphs as introduced by Wong are a promising tool for structure-based classi$cation. Their complexity, however, hampers their practical application. We describe an extension to $rst order random graphs which uses continuous Gaussian distributions to model the densities of all random elements in a random graph. These First Order Gaussian Graphs (FOGGs) are shown to have severa...

متن کامل

Feature-based Classi cation of Time-series Data ALEX NANOPOULOS ROB ALCOCK

In this paper we propose the use of statistical features for time-series classi cation. The classi cation is performed with a multi-layer perceptron (MLP) neural network. The proposed method is examined in the context of Control Chart Pattern data, which are time series used in Statistical Process Control. Experimental results verify the e ciency of the feature-based classi cation method, compa...

متن کامل

A Common Lisp Framework for Document Classi cation and Retrieval

This paper describes the Document Classi cation Substrate (DCS) and accompanying protocols. The DCS is a framework of Lisp support code facilitating the prototyping and deployment of systems for automatic document classi cation and retrieval applications. The DCS design re ects the following observations concerning the problem of classi cation of texts. 1. Initial preprocessing (lexical feature...

متن کامل

A New Probabilistic Model of Text Classi cation and

This paper introduces the multinomial model of text classiication and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The mult...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996